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Abstract 

This paper considers the recovery of a rank r positive semidefinite matrix G 

from m scalar measurements of the form yi := afXX^ai (i.e., quadratic measurements of 
V). Such problems arise in a variety of applications, including covariance sketching of high¬ 
dimensional data streams, quadratic regression, quantum state tomography, among others. A 
natural approach to this problem is to minimize the loss function /([/) = ~ afUU’^ai)'^ 

which has an entire manifold of solutions given by {XO}oeOr- where Or is the orthogonal group 
of r X r orthogonal matrices; this is non-convex in the n x r matrix U, but methods like gradient 
descent are simple and easy to implement (as compared to semidefinite relaxation approaches). 

In this paper we show that once we have m > C'nrlog^(n) samples from isotropic gaussian 
Oi, with high probability (a) this function admits a dimension-independent region of local strong 
convexity on lines perpendicular to the solution manifold, and (b) with an additional polynomial 
factor of r samples, a simple spectral initialization will land within the region of convexity with 
high probability. Together, this implies that gradient descent with initialization (but no re¬ 
sampling) will converge linearly to the correct V, up to an orthogonal transformation. We 
believe that this general technique (local convexity reachable by spectral initialization) should 
prove applicable to a broader class of nonconvex optimization problems. 


1 Introduction 


Consider X G fixed and unknown, acquired through quadratic measurements of the form 

yi := aJXX'^ai = ix{XX'^aiaJ) = ||V^aj|||, i = 1.. . m. Assuming X has full column rank, this 
is equivalent to receiving noiseless samples of Mai of the positive semidefinite matrix M := XX'^. 
Scenarios such as this arise in various applications: for a concrete example, suppose we receive 
a stream of high-dimensional centered Gaussian vectors {xj}^^ with unknown covariance matrix 
S. If we believe S to be (approximately) low rank, we can recover this structure via the sampled 
matrix 

S = — X XjxJ. 
j 


However, due to the large dimensionality, storing all of the incoming vectors Xi might be prohibitive. 
Instead, we randomly draw a set of sensing vectors {a^} which are efficient to store (e.g., they are 
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sparse) and for each incoming data point compute pki = We are now only storing ak)k,i 

which is a sparse data set. Note that if we define 
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Vk := 


m 


then Hk has the form 



3 


The question posed above is: can we compute the covariance structure of the {x*} given only this 
data? 


The example above describes covariance sketching of high-dimensional data streams [DSBN12, 
CCG13], but there are many other scenarios that fall under our problem setting, e.g., phaseless 
measurements in physics and optics [RBM94, TLOB12, Ger72, Fie82]. Because this data is invariant 
under the transformation 


XO 


for any orthogonal matrix O G we can only hope to recover X up to this action. In the 

complex rank one setting x G C”, this means we can only recover x up to a global phase. Phase 
retrieval problems of this type often arise in the physical sciences due to the nature of optical 
sensors, which can only record intensity information [Ger72, Fie82]. It is now well-understood 
that if the measurement vectors a* are generic or e.g. independent Gaussian random vectors, 
then m > 0{n) measurements suffice for injectivity of the map x i—>■ (|(aj,x)p)™^ up to phase 
[BGE06, EM14] . There is still a question of how to perform the inverse map in an efficient and stable 
manner, and in recent years several different algorithms have been proposed in this direction, see for 
example [BBGE09, BalI2, NJS13, CSV13, GL14, ABEM14, DH14, EM14, CESV15]. In particular, 
[BBCE09] noted that such measurements may be reformulated as y* = tr (aja*xx*), so that one 
can consider this problem as that of recovering an unknown rank-one positive semi-definite matrix. 
Inspired by the field of compressive sensing [CT06, Don06] and low-rank matrix recovery [RFPIO], 
this led to many results demonstrating that well chosen convex and semidefinite programming 
(SDP) relaxations can provably recover the underlying signal up to phase with only m > Cn 
Gaussian measurements [GMPll, CSV13, WdMI5]. However, as such algorithms optimize over the 
“lifted” space of n x n positive semidefinite matrices, the computational complexity becomes quite 
high. In the more general rank-r setting, whereby the measurements are y^ := ti{XX'^aiaf) = 
||A^Oi|||, the recent works [KRT14, GGG13] demonstrate that convex relaxation techniques based 
on nuclear norm minimization can solve such problems from an optimal number of measurements 
0{nr), but still require large computational cost. 

In the rank-1 setting in particular, several alternative reconstruction algorithms have been 
proposed with global phase recovery guarantees which operate directly on the lower-dimensional 
problem, and thus are more computationally efficient. Notably, [NJS13] considers the nonconvex 
optimization problem 



( 1 ) 


2 = 1 


and proves that after a judiciously chosen initialization, with high probability alternating minimiza¬ 
tion will converge to the underlying vector x up to phase, assuming random Gaussian measurements. 
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Subsequently [CLS14] used the same initialization to show convergence when followed by gradient 
descent without requiring resampling. Both of these algorithms provably recover the underlying 
vector X up to global phase, from a number of measurements m which is optimal up to additional 
logarithmic factors in re. Very recently, the paper [CC15] provides a modified gradient method 
which removes the additional logarithmic factors of re in the number of measurements. 

In a similar vein, many recent works have demonstrated global convergence guarantees for 
gradient descent on other nonconvex matrix factorization problems. Specifically, in [ZB 15] the 
authors consider gradient descent on the Grassmannian and prove global convergence for a class of 
SVD problems. In [DSOR14] a stochastic gradient algorithm was shown to converge globally for a 
low-rank matrix least squares problem. In [SQW15] the authors consider the recovery of a full-rank 
matrix from sparse linear measurements via manifold optimization over the sphere. In all of these 
works including ours, the underlying idea is that the lack of convexity can be fixed by operating 
on an appropriate matrix manifold. 

In this paper, we consider the more general version of problem (1) in which the underlying 
matrix is of rank r: 


1 

mm - 
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2 = 1 




( 2 ) 


and our measurements take the form 


Vi := tr{XX’^aiaf) = \\X'^ai\\l. 

As noted in [CSV13], it seems unlikely that a deterministic RIP condition holds in this setting. In 
any case, our local convexity results are novel and might shed light on other nonconvex problems 
unrelated to matrix recovery. 

Note that throughout the paper we assume that X has full rank, and without loss of generality 
we can assume its columns are orthogonal. In contrast to previous algorithms operating directly 
on the rank one problem (1), we demonstrate that, under a Gaussian assumption on the random 
measurement vectors, the function (2) is strongly convex in certain directions, and we can recover X 
(up to an orthogonal matrix) via spectral initialization followed by gradient descent. In particular, 
we prove gradient descent will converge linearly at a rate which depends on the condition number 
Ar/Ai of the full matrix XX^ and the ambient dimension re. Moreover, the classical phase retrieval 
problem x G C"" can be recast as a rank 2 recovery problem within our framework (see §3.4). 
Precisely, we show the following all hold with high probability: 

• for general r, we demonstrate that after rre > C'rer(logre)^ Gaussian samples, in a quantifiable 
region the function (2) is strongly convex in directions perpendicular to the manifold of 
solutions 

• The size of this region is independent of both the ambient dimension and the rank 

• with an additional factor of r® samples, a simple spectral initialization will land within this 
region with high probability and thus standard gradient descent on (2) will linearly converge 
to a global minimizer 

• if r = 1 and x G M” is “not too peaky” , then for more general sub-gaussian measurements, 
after rre > Cre(logre)^ subgaussian samples, the function appearing in (2) is strongly convex 
in a ball centered at x (and —x), whose radius we explicitly compute. 
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In the real-valued rank one setting, the strong convexity result we present actually holds in much 
more generality than the initialization result - for sub-gaussian measurements - and we believe 
this should be of independent interest; in particular, our results hold for Bernoulli measurements 
and Sparse Gaussian measurements. We note that in the rank-1 setting, recovery results from 
general sub-gaussian measurements were also provided in [KL15] using convex optimization for 
reconstruction, and a similar incoherence condition on the underlying x was also required there. 

While preparing this manuscript, we became aware of [[Solid], p.250] which also certifies local 
convexity of the function (2), for the special case of Gaussian measurements in the rank one setting. 

Our results can be viewed as exact recovery guarantees for a special case of a manifold- 
constrained least squares problem where the manifold is the set of rank r positive semidefinite 
matrices. This algorithm was studied empirically in [FM15]. Many nonconvex problems of interest 
can be reformulated as a manifold-constrained least squares problem, and we believe that the exact 
recovery guarantees presented here should be extendable to a broader class of problems. 


2 Main results 


We aim to solve the nonconvex inverse problem of recovering the unknown matrix X G (up 

to right multiplication by an orthogonal matrix) from quadratic measurements of the form 


Vi ■= \\aJX\\2 


by solving the nonconvex optimization problem 

^ m 

min f(U) := min -— > (yi 

TaWnXr [/gjjnxr 4?77, ^ ^ 




- \\a-. 




2 = 1 


(3) 

(4) 


Because the function appearing in (4) is invariant under right multiplication by an orthogonal 
matrix, there is an entire manifold of solutions given by {XO : O G 0{r)} where 0{r) is the set of 
r X r orthogonal matrices. Our strategy is to establish that a spectral initialization will land (with 
high probability) in a region of strong convexity^ around the manifold of global minimizers. An 
overview of our approach is given in Algorithm 1. 

There are two main ingredients to proving performance guarantees for Algorithm 1, namely, the 
strong convexity of the function / in a region around the manifold of global minimizers at finite 
sample complexity, and a guarantee that spectral initialization will land within this region. The 
finite sample convexity result holds in more generality when r = 1, while for general r we always 
assume Gaussian measurements. 


2.1 Rank-r Matrix Recovery 

We focus on the setting where X G is a fixed unknown matrix with orthogonal columns, and 
we receive noiseless samples of the form 

Vi ■= (5) 

^Strong convexity here and throughout always refers to convexity in directions orthogonal to the manifold of 
solutions. 
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Algorithm 1 Initialize and Descend for finding global solutions to (4). 

Input: Measurements yi = ||a^A|p, i = 1,2,... ,m, where a* are Gaussian 

Initialize: Uq = where the columns of Z contain the (t' 2 -normalized) eigenvectors corre¬ 

sponding to the r largest eigenvalues cji > ... > Ur of the matrix 


M : = 


1 

2m 


m 

i=l 


and the scaling is given by the diagonal matrix 




Descend: Starting at Uq, iteratively update U via gradient descent on f{U). 
Output: Estimated global solution X to the nonconvex problem (12). 


for i.i.d. standard Gaussian vectors Now that we have an entire manifold of solutions 

given by {XO : O G 0{r)} we will need to consider the quantity 

diU):= mm \\XO-U\\l (6) 

0£0(r) 

which is well-defined by compactness of the orthogonal group. We note that the minimizer may 
not be unique, but this is not important for our purposes. We will also need to consider 




(7) 


the non-zero eigenvalues of the positive semidefinite matrix XX"^. For a given matrix U G the 
notation stands for the Hessian of the function (4), which is a random nr x nr symmetric 

matrix. 

The main finite sample convexity result is as follows: 

Theorem 2.1 (Strong Convexity). Suppose we take m > C'(Ar/Ai)“^nr(logn)^ samples of the 
form (5), where Ai and Xr are as in (7); assume further that U G satisfies 


min IIAO — U\\f < 

oeci(r) 


3Xr 

loiixiif- 


( 8 ) 


Then with probability at least 1 — 4e 1 jrr?, it holds that 


vec(D - XO*fV^f{U)Yec{U - XO*) > - XO*fF 

-Lo||A II jp 

vec{U - XO*)'^X^f{U)vec{U - XO*) < C + AJ \\U - XO*\\l 


where O* is a minimizer for (8) . 
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For the proof of this theorem, we refer the reader to Section §3.1 below. Note that this theorem 
implies that for matrices U G close to the manifold of solutions, we can control the eigenvalues 
of the Hessian; in particular for such U, the function f{U) is strongly and uniformly convex on the 
line connecting U to its nearest point on the manifold of solutions, as measured by the function (6). 

We now show how this local strong convexity results in linear convergence to the true X we 
seek to recover. We have the following theorem which concisely establishes the initialization and 
performance guarantees of Algorithm 1. 

Theorem 2.2 (Main Theorem). Suppose we take m > C\\X\\^pXf'^nr‘^{\ogn)‘^ samples of the form 
(5), where Ai and Xr are as in (7). Define the matrix 


^ m 


and the following associated quantities: 

U := [ui U2 


2 = 1 


Ur 


cji 0 ... 0 

S := 0 (72 ••• 0 

. 0 ar 

Uo := 


CT 7*-j- ]_ 7 d 


where cii > ct 2 ... > ar+i > 0 are the eigenvalues of M and Ui are the corresponding normalized 
eigenvectors. If we iteratively update via gradient descent 

C/fc+i = f/fc - 7V/(t/), 

then with probability at least 1 — 3e“'’” — Ijrnf, 

d{Uk) < [1 -277 + 72 ^ 2 ] 

where 

n 

mx\\i 

B = Cn^ \og{nr^)‘^Xr 

and d{Uk) is given by ( 6 ). In particular, d{Uk) converges to zero geometrically as long as^ < llj. 

For a proof of Theorem 2.2, see Section 3.3. 

A few remarks are in order. 


1. Note that the quantity ||X|||,A^^ is scale invariant; however, we have the bounds 

< r'^{\i/\r)^. 

2. One consequence of our result is that the sampling complexity is entirely independent of the 
desired solution tolerance. That is, the fixed set of m > Cnr®(logn)2 samples suffices to 
produce a global solution up to arbitrary accuracy. 


6 





3. Our numerical results in §4 suggest that in general the sampling complexity only linearly 
depends on the ambient dimension n. Consequently a more refined analysis and initialization 
procedure such as that found in the recent work of [CC15] for the case of rank-1 recovery is 
most likely possible also in the general rank-r recovery setting. 

4. This method of analysis should find use in providing recovery guarantees by gradient descent 
for a broader class of nonconvex problems arising in machine learning applications such as 
matrix completion, nonnegative matrix factorization, clustering, etc. More generally, such an 
analysis could possibly be useful towards achieving provable guarantees for machine learning 
problems which have many unstable saddle points, such as neural networks [DPG^14]. 

2.2 Rank-One Matrix Recovery 

In the rank one setting where x G M”", we can provide local convexity guarantees holding more gen¬ 
erally for sub-gaussian measurements, subject to appropriate incoherence conditions on x. Suppose 
we receive noiseless samples of the form yi := (af x)'^ for i.i.d. sub-gaussian vectors which 

we assume satisfy: 


Ek] = 0 
EkUj J = E 

where S is the covariance matrix, which we assume is invertible. With this setup, we then consider 
minimization of the random function 

■ ( 12 ) 

2=1 

Note that here the Hessian is given by 

- m 

VV(k = — {yajuf - {ajxf ) ataf. (13) 

2=1 

We have the following convexity theorem: 

Theorem 2.3 (Strong convexity). Letx G M" and {ai}^^ be i.i.d. sub-gaussian satisfying K[ai] = 0 
and E[aja^] = S. With V‘^f{x) as in (13), define 

A:= Xmin (E [v2/(k])- 

If m > C\\T,\\^^n{\.ognfi, then with probability greater than 1 — 4/n^, 

(U - xfV^f{u){u -X)> 

holds uniformly for all u G M"" in the ellipse around x defined by 

- 30||SV2x||2' 

Above, C > 0 is a eonstant which depends only on the sub-gaussian norm of Oi. 



7 




For a broad class of sub-gaussian measurements, we can provide an explicit lower bound on 
Xmin (lE [V^/(x)]) thereby establishing a quantitative bound on the strong convexity parameter. 
In the case of Gaussian measurements in particular, such a bound holds independent of x. For 
more general sub-gaussian measurements, additional incoherence constraints on x - that x not be 
“too peaky” - are required for strong convexity. See Lemma 3.9 for details. While preparing this 
paper we became aware of related results in the thesis [Solid] which demonstrate a similar lower 
bound on the Hessian in the case of Gaussian measurements. 

The finite sample convexity result holds for general sub-gaussian measurements satisfying (22), 
while our initialization results require more restrictive conditions, namely that the fourth moment 
of the measurements is close to that of Gaussian measurements; for simplicity we have only included 
the result for Gaussians which follows from Lemma 3.11 in the next section. 

The rest of the paper is organized as follows: in §3.1 we prove the main finite sample convexity 
result Theorem 2.1, which relies on classifying tangent and normal directions to the manifold of 
solutions {XO : O G 0{r)} and an explicit formula for the expected Hessian. In §3.2 we prove 
convexity results for the rank one case under more general randomness assumptions. In §3.3 
we prove that with high probability the initialization step produces a matrix in a convex region 
around the manifold of solutions and establish the convergence of gradient descent. Briefly in §3.4 
we describe how our results generalize to the complex setting. Finally, in §4 we conclude with some 
numerical experiments demonstrating the performance and robustness of the results presented here. 


3 Convexity 

3.1 General Low-Rank 


Here we present lemmas that are used in the proof of Theorem 2.1, as well as a summary of the 
proof. For the full proof, we refer the reader to Section 5.1.3. 

The main lemma we rely on is the following simple characterization of the normal directions to 
the manifold of solutions: 


Lemma 3.1. Assume X has full column rank and let O* 
necessarily unique. Then we can write 


= arg nim ||XO-C/||^, 
OeC>(r) 


which is not 


UO*^ = X{X'^X)-'^M + P±U 


where M G is a symmetric positive semidefinite matrix and P± is the projection onto the 

orthogonal complement of the column space of X. 

Proof. This basically follows from the solution to the Orthogonal Procrustes Problem [Sch66]. If 
we write X'^U = ZDV^ for the singular value decomposition of X"^U, then we can expand the 


objective as follows: 


arg mm \\X0 - Ufp = arg mm \\X\\l + \\U\\j, - 2{X0, U)f 

OeO(r) OeO(r) 

= arg max {XO,U)f 

OeO(r) 

= arg max (O, ZDV'^)f 
OeC>(r) 

= Z I arg max {0',D)f] V'^ 

V O'eO(r) J 

= ZV'^. 


We then find that 

X^UO*'^ = ZDZ^ 


is a symmetric positive semidefinite matrix. As X'^A = 0 is equivalent to P±A = A, we arrive at 
the stated claim. □ 


This lemma says that if we consider the direction W = U — XO* between U and its closest solution 
matrix XO* we have that 

0*'^X^W = O*^ (M - X'^X) O* 

which is a symmetric matrix. Why symmetry is important will become apparent after the next 
lemma, which establishes formulas for the expectation of the Hessian of (4): 

Lemma 3.2. The gradient of f{U) = ~ aJUU'^ai)'^ is given by 

V/([/) = [V/i(C/) ... V/,([/)] G (17) 


where 


^ m 

^fk{U) = — '^{aJUU^ai - yi){a'[uk)ai 


(18) 


2=1 


and the Hessian of f{U) is given by 

{aJUU'^ai + 2{ajuif - yi)aiaj 
2{afui){afur)aiaf 


vV(c/) = -E 


2=1 


{afUU'^ai + 2{afur)‘^ - yi)aia. 


(19) 


Moreover, if we suppose the at’s are i.i.d. centered Gaussian random vectors satisfying E[aa^] = Id, 
then the expectation of (19) is given by 

E[VV(A)]=A + D (20) 

where the nr x nr block matrices A and D satisfy 

Aij = [2uiu^ + 2ujuJ + 2{uJuj)Id]_^^^ (21a) 

D,, = [{\\UrF-\\XrF)Id + 2{UU^ - XX'^)]^^^. (21b) 

For details, see §5.1.1 in the Appendix. We will also need a standard concentration result: 
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Lemma 3.3. Suppose we collect m > C5 ^/3nrlog(nr) samples of the form yi := af XX'^Oi, where 
5 and fd are given constants and r = rank(X); then we have that with probability greater than 
1 _ 2e-/3™ - 6/m2 

\\V^fiX) -E[V^f{X)]\l^<26\\X\\lp. 

The sampling complexity can be improved, but we state Lemma 3.3 as a general proof-of-concept. 
For details see S5.1.2. 


To complete the proof sketch, observe that 

vec([/ - XO*fv‘^f{U)vec{U - XO*) 

can be written as a convex quadratic polynomial in \\U — XO*\\f, where the constant term is given 
by 

vec(C/ - XO*fV^f{XO*)vec{U - XO*) 

and consequently we can bound its smallest positive root using the remarks above (see §3.2.2 for the 
rank one setting, where this observation is more straightforward). We apply the concentration from 
above along with the following one-sided martingale bound from [Ben03] (as stated in [CLS14]) to 
establish the stated non-asymptotic bound. For details see §5.1.3. 

Lemma 3.4. Suppose ¥±,¥ 2 , ..., Tm are i.i.d. real-valued random variables obeying ¥i <b for some 
nonrandom b > 0, E[F)] = 0, and Setting = m ■ max(6^,n^), 

P[Yi + >2 + + Ym>y]< minjexp >co(l - 4>(y/cr))| 

where one can take cq = 25 and <h(-) is the CDF for the standard normal. 

3.2 Rank One 

In this section we restrict our attention to the setting where x G M”' is a fixed unknown vector, and 
we receive noiseless samples of the form yi := {af x)"^ for i.i.d. sub-gaussian vectors which 

we assume satisfy: 


E[aj] = 0 

E[aiaf] = S 


where S is the covariance matrix, which we assume is invertible. Consider the eigenvalue decom¬ 
position of the covariance matrix, S = important quantity in our analysis will 

be 

(23) 


(x) := max (nfcx)^||S^/^x| 


l<k<n 


-2 
2 > 


a coherence parameter for S and 


Hi = Elivlai)^^], 


a 4th moment parameter. 
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3.2.1 Convexity in Expectation 


We consider convexity of the function f{u) defined in (12) (equivalently, positive semi-definiteness 
of the Hessian matrix V^/(u)) in the neighborhood oi u = x, first in expectation with respect to 
the draw of a,, or in the limit of infinitely many samples m. These results are necessary for the 
proof of Lemma 3.9. 

Lemma 3.5. Assume that are centered sub-gaussian random vectors with E[aa'^] = S. 

Assume further that the transformed variables bi := have independent coordinates and 

equal fourth moment parameter |U 4 := Then 

= (3||Si/2^||2 - ||sV2a;||2^ s 

^ (24) 

-|- S — Ixx'^) S -|- {fi4 — 3) ^ (3(uju)^ — (ujx)^) Vkv'^. 

k=l 


For details, see §5.3.1 in the Appendix. We then have the following asymptotic convexity result: 


Lemma 3.6. Let x G M"’ and consider the funetion E[/(tt)] with f{u) defined in (12). Under the 
same assumptions as in Lemma 3.5 above, E[/(n)] is convex in the ellipse 



/! + min(r(x), 1 / 2)[^4 - 
V 3 + t{x)[h4^ - 3]+ 





where t{x) is the coherence ofYfl’^x as in (23), and above [«]_ = min{u,0} and [u]+ = max{u, 0}. 

The proof of Lemma 3.6 relies on the fact that E[/(x — tw)] is a convex quadratic polynomial 
in t. Using this insight, we can actually bound the largest and smallest eigenvalues of the expected 
Hessian whenever we are within a restricted region 

for some J < 1. In fact we find that a loose bound is given by 


Xmax (E[V2/(«)]) < I|S||opI|EV2 


6(5 -|- 2(3 -|- t ( x )[//4 — 3]+) 

y 


Xmin (E[v2/(«)]) > [-25(3 + r(x)[//4 - 3]_) + 2(1 + r(x)[//4 - 3]_)] . 


Remark 3.7. This result alone provides enough information to prove performance guarantees 
for stochastic gradient descent after an initialization procedure. Via a union bound and covering 
argument, this result along with matrix concentration will also guarantee uniform convexity in this 
region at finite sample size m > n^. However, to ensure uniform convexity at finite sample size 
m > C'nlog(n), we will need a more rehned analysis based on the structure of the Hessian matrix, 
as presented in the next section. 
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3.2.2 Non-Asymptotic Convexity 

Here we present the sketch of the proof of Theorem 2.3. For the full proof, we refer the reader to 
Section 5.3.5. 

As before, we will use the standard concentration result: 

Lemma 3.8. Let a: G M" and {ai}^^ be i.i.d. sub-gaussian, satisfying (22). Then there exists a 
constant C depending only on the sub-gaussian norm of Oi such that if m > C'e“^||S||opn(log n)^, 
then with probability greater than 1 — 3fn^ it holds that 

^ m 

— x)^aiaf — E [(a^x)^oo^] 

771 . 

1=1 

This result can be proved by first truncating the norms of the measurements vectors and then 
applying Matrix Bernstein’s Inequality (e.g., [Trol2]). The sampling complexity can be improved, 
but we state Lemma 3.8 as a general proof-of-concept. For details see §5.3.4. For e sufficiently 
small, this result indicates that we can control the eigenvalues of V^/(u) for u sufficiently close to 
X. In particular, if V‘^f{u) is positive definite in a region around x, then f{u) is strongly convex 
and x is the unique minimum in this region. It is not immediately clear how to extend such control 
to a quantifiable region around x. However, Theorem 2.3 requires only that we have a lower bound 
on the eigenvalues. 

Assuming that S = Id, the same technique from §3.1 can be applied: first write u = x -\-tw for 
a unit vector ||r /)||2 = 1 and observe that 


<e||Si/2x||2 


iu — f{u){u — x) = — ‘i{aJw)H^ + Q{(x^w)^iajx)t + 2{ajxw'^aiy 

m 

i=l 

m ^ m 

+ B,)‘ - - 


3 

m 


2=1 


2=1 


using (13), where we have defined 


Ai := (afwf 
Bi := (^afxw'^ai) . 


Note that 

m 

— y] = -(u - xfv^f{x){u - x) 

2 = 1 


and consequently using Lemma 3.8 and Lemma 3.6 we can control this term. As before, applying 
Lemma 3.4 to the positive term 


3 

m 


{Ait + Hj)^ 

2=1 


yields the stated conclusion. 

For a broad class of sub-gaussian measurements, we can provide an explicit lower bound on 
Xmin (E [V^/(x)]) , thereby establishing a quantitative bound on the strong convexity parameter. 
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Lemma 3.9. Suppose that are centered sub-gaussian random vectors with independent 

coordinates, standard covariance 'E[aa'^] = Id, and equal fourth moment parameter /i 4 := IE[a^^]. 
Then 

Amin (IE [V^/(x)]) > 2 (1 + min{T(x), l/ 2 }min{/i 4 - 3,0}) ||x ||2 (27) 

where t{x) = max |, is the coherence of x. 

l<k<n ||x||2 

The proof of this uses the fact that the smallest eigenvalue is a concave function of ^ 4 ; the proof 
can be found in §5.3.2. We can now quantify the lower bound appearing in (15) for a large class of 
sub-gaussian measurements: 

1. Bernoulli: For standard Bernoulli measurement vectors, where Oj^ are i.i.d. ±1 with equal 
probability, /i 4 = 1 and we have a quantifiable strong convexity guarantee so long as x is 
incoherent, i.e., t{x) < 1/2. This is sharp in the sense that for x = [l/\/2 l/\/^ the 
expected Hessian has a 0 eigenvalue. 

2. Gaussian: For vectors a* with i.i.d. standard Gaussian entries, /i 4 = 3 and Lemma 3.9 
provides the uniform lower bound 

VV(n) ^ ^llxlli (28) 

for all ||u — x \\2 < 

3. Sparse Gaussian: Note that (28) holds anytime ^4 > 3 by Lemma 3.9. This includes sparse 
Gaussian vectors, whose coordinates are i.i.d. standard normal with probability p and 0 with 
probability 1 — p. In this case 

/U 4 = 3/p. 

3.3 Initialization and Gradient Descent 

We have shown that the function is strongly convex in a quantifiable region around the global 
minimizers. To guarantee results for gradient descent, we will also need the following lemma which 
bounds the Lipschitz constant of the gradient of our function. 

Lemma 3.10. Consider the function f{U) = ~ Ikr^lli)^- Suppose m> C. For a 

universal constant C > 0, it holds with probability exceeding 1 — 2m~^ that for any U within the 
region of convexity given by (8), 


||V/(H)-V/(X)||^ ^ 

\\U-XO*\\f 

with B = Cn^ log(m)^Ar-. Here, Ai > A2 > • • • > Ar are the eigenvalues of XX'^ and \\U—XO*\\‘jp = 

minoeOM 11-^0 - 

Proof. By the sub-gaussian assumption, the following holds with probability exceeding 1 — 2m~^ 

max ||ai ||2 < C'nlog(m). 

l<i<m 
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Conditioning on this event, recalling that ||-^|||’ = Ai + • • • + A^, and recalling the formula for the 
gradient V/ in (18), observe the bound 

l|V/fe(17) - V/fc(X)||2 = \\Vfk{U)\\2 < max||ai||2 \{afUU^ai - yi){afuk)\ 

I 

< \/C'n log(m) max \{aj{UU'^ — XO*X'^)ai){aJ— ffc))! 

i 

< iCnlog{m)f\\uk - VkhWU - XO*\\f{\\U\\f + ||X||ir) 

< C'(nlog(m))^||nfc - (-[p^-h 2||X||i7’) 

||A 11^ ||A \\p 

< C{nlog{m))'^\\uk - VkhK 

Thus, \\Vf{U)-Vf{X)\\F<CXrn^log{m)^\\U-XO*\\F. □ 

It remains to certify a point in this region to initialize gradient descent. 

Lemma 3.11. Suppose we take m > C'/3A“'^||Af|||,nr^(logre)^ samples of the form (5), where Ai 
and Xr are as in (7). Define the matrix 

^ m 

1 = 1 


and the following quantities: 


U := [ui U2 ... 

a I 0 ... 0 

S := 0 (T2 ••• 0 

. 0 (Trj 

Uq := 


(T d 


where ai > (J 2 ... > (Jr+i > 0 are the eigenvalues of M and Ui are the corresponding normalized 
eigenvectors. Then with probability at least 1 — — 7/m^ we have that 


d{Uo) < 


9 

100||X||^ 


A^ 


(30) 


where d{U) is defined as 


d{U):= min \\XO-U\\l. 

0&0{r) 


For the proof, see §5.2. Initializing from a matrix satisfying (30) guarantees we are close enough so 
that gradient descent will converge. We can now prove the main theorem. Theorem 2.2: 

Proof of Theorem 2.2. Given the number of samples m > C/dXf^WXWpnrflogn), the following 
events simultaneously occur with the stated probability: 

• Lemma 3.11 holds, and thus d{Uo) := min( 9 gc>(^) ~ UqW'f < ioo||x|p 
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• Theorem 2.1 holds, and so considering the Taylor expansion of / around U, the following 

holds for all U satisfying d{U) < ■ 

{U-XO*,Vf{U))>i\\U-XO*fF, £=^. 

• Lemma 3.10 holds with Lipschitz constant B = Cn^log{nr^)‘^Xr. 

Let U~^ := Uq — 'yVf{Uo) and O* = argminogc)(^) W^O — C/o|||’- Then 

d{U^) < \\U+ -XO*\\l 

= \\Uo--fVf{Uo)-XO*\\l 

= \\Uo- XO*\\l - 2j{Vf{Uo), Uo - XO*) + 7'||V/(C/o) - V/(X)||| 

= \\Uo- XO*fF - 27(V/(C/o), Uo - XO*) + 7'l|V/(C/o) - V/(XO*)||| 

< \\Uo - XO*\\l - 2^£\\Uo - XO*\\l + -i^B‘^\\Uo - XO*\\l 
= [l-2-i£ + -i‘^B‘^]\\Uo-XO*\\l 

= [l-2^£ + ^‘^B^]d{Uo) (31) 

and, by induction, d{Uk) < [1 — 2j£ + 7^i?^]^d(?7o)- D 


3.4 The Complex Case 

Suppose we are in the classical phase retrieval setting of attempting to recover an unknown vector 
z G C"" via Gaussian measurements of the form aj + ibj where aj, bj ~ AA(0,1/2/d). If we write z 
as z = X + itc then a given measurement yj takes the form 

yj = \{ajx — bjw) + i{ajw + bjx)\^ 

= {ajx — b^wf' + {ajw + bjx)"^. 

Note that we can cast z as a matrix Z G ]^ 2 nx 2 


Z := 


X 

—w 


w 

X 


(33) 


and the measurement yj becomes 



bJ] ZZ^ 



where we note 



~ Af{0, l/2Id2nx2n)- 


Moreover, the columns of Z are orthogonal, and the map 


e 


ie 


cos 0 sin 0 
— sin 0 cos 6 


gives us an isomorphism onto the orientation preserving component of the orthogonal group 0{2). 
Thus this problem is equivalent to recovering an unknown real-valued rank 2 matrix, and the results 
in the previous sections reproduce known optimality guarantees for gradient descent in the phase 
retrieval model as found in e.g., [CLS14]. Specifically, we have shown the following: 
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Corollary 3.12. Given m > C'n(logn)^ noiseless samples of the form 


_ I I 2 

Vi = Kz\ 




where z € is an unknown vector, define 

:= [3f?(c 

and let Z be the output of Algorithm 1 applied to the data with constant step size 

7 < Cn-^. 

Then with probability at least 1 — 3e“^” — 7/mf we have that 


d(U) := min \\ZO - Z\\l < Cq 
oec>(2) 


l-^^+7'cV||x||^ 


where k is the number of iterations of gradient descent and Z is the matrix (33). 


4 Examples and Experiments 

First, we consider the performance of the algorithm (1) in the rank-1 real-valued setting, where the 
measurements are pi = {ajx)"^. Our numerical studies strongly suggest that the algorithm (1) is 
stable to noise, that is, given measurements of the form pi = + ho the algorithm successfully 

returns an matrix x up to the noise level < || 7 || 2 - We consider three different measurement 

ensembles: 


• Bernoulli: ai are i.i.d. Bernoulli random vectors 

• Standard Gaussian: ai are i.i.d. drawn from AA(0,/d^ixn)- 

• Gaussian with covariance: ai are i.i.d drawn from with covariance matrix 


S 


ij 


1, *=j 


In a first experiment, we fix an n-dimensional vector x of unit norm with randomly-generated 
coefficients, and consider noiseless measurements pi = {ajx)"^. We implement the meta-algorithm 1, 
calling Matlab’s built-in function fminunc to find a stationary point starting from the initialization. 
In the local optimization procedure, we do not provide any information to fminunc other than the 
function itself; by default Matlab uses a quasi-Newton method for local minimization. We run this 
experiment using the three different measurement ensembles above, at problem size n = 100 and 
at a number of measurements m = 2n, 3n,..., 8n. If the solution x recovered by the algorithm is 
within the tolerance min{||x — x|| 2 , ||x-|-x|| 2 } < .001, we say the algorithm has succeeded in finding 
the global solution. In Figure 1, the results of this experiment are displayed, averaged over 100 
trials. 


Next, we analyze numerically the stability of the algorithm to additive measurement noise. For 
these experiments, we consider noisy measurements of the form 

Pi = {ajxf + r]i 
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Figure 1: Phase transitions for exact recovery via Algorithm Initialize and Descend (1) using 
different random measurement ensembles. 




Figure 2: Performance of Algorithm Initialize and Descend (1) in the presence of additive noise 
Hi = {aJxY + 111 different signal-to-noise levels. Right: \\f]\\ 2 /Y^i{o% x)'^ = -5 and Left: 
ll*/E.(«P)" = 2. 


where rn are i.i.d. mean-zero uniformly distributed, and normalized such that \\r ]\\2 = /i|| 
for /X = .5 (low signal to noise ratio) and n = 2 (high signal to noise ratio). We observe that the 
meta-algorithm is robust to such additive noise, with relative reconstruction error min{||x—x|| 2 , ||ai+ 
a:|| 2 } averaging below the signal to noise threshold. We leave a theoretical analysis of this observed 
noise stability to future work. 

Finally, we test the performance of Algorithm 1 in the more general rank-r case. We consider 
noiseless measurements m = af XX'^ai where the ai are i.i.d. standard Gaussian and X G is a 
rank-5 matrix with orthogonal columns, normalized so that ||^||f = 1- We implement Algorithm 1, 
calling Matlab’s built-in function fminunc to find a stationary point starting from the initialization. 
In the local optimization procedure, we do not provide any information to fminunc other than the 
function itself; by default Matlab uses a quasi-Newton method for local minimization. We run this 
experiment at problem size n = 20 and m = 5n, lOn,..., 25n. We declare the algorithm to have 
converged to global solution if the matrix X G recovered by the algorithm satisfies 

IIXZV'^ - X\\f = min IIAO - AIR < .001 

Oeci(r) 
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Figure 3: Phase transitions for exact recovery via Algorithm Initialize and Descend (1) in recovering 
a rank-5 matrix from qnadratic Gaussian measurements. 

where = X^X is the singular value decomposition.n In Figure 1, the results of the experi¬ 

ment are displayed, averaged over 100 trials. 
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5 Appendix 

5.1 Proofs for §3.1 

5.1.1 Proof of Lemma 3.2 

Proof. The proofs of (18) and (19) use standard vector calculus. For (24), we use the fact that for 
any vector x G 

E[(x^a)^aa"^] = ||x||2/d + 2xx^ 


20 


which can be seen by writing out the entries of the matrix individually. This implies 


which, combined with 
yields the stated result. 


E[(a'^XX^a)aa'^] = ^ E[(a'^Xj)^aa'^] 

i=l 

r 

= {WxiWpd + 2xixf) 
i=l 


K[{a^Xi){a^Xj)aa?"] = XixJ + XjxJ + xjxjld 


□ 


5.1.2 Proof of Lemma 3.3 

We begin with a more general concentration result. 

Theorem 5.1. Let X G be a given matrix with orthogonal columns; suppose m > C5~‘^fjnr\og{n) 
where 6 and fd are given constants and r =rank(X). Then we have that with probability greater 
than 1 — — 6/m^ 


— ^ ® “ E[aiaf (g) Oiaf] 


2 = 1 


< 5 


where || • || i is the operator norm of the matrix restricted to the subspace spanned by the columns 

lx®K" 

of X tensored with'ST. 

Proof. Let x ® z G X ® M"' be an arbitrary unit vector. Write z = z_\_ + w, where w € X and 
{z±,x) = 0. We must consider the quantity 


\—J2^aIxf{aJzf - 2{x'^zf - l| 

I ^ I 

^ m ^ m 

- '^(.ajxf{afz± f - \\z±\\l + — '^{ajxfiafw)'^ - 2{x^w) 


2 = 1 


2 II ||2 

- \m\2 


2=1 


2=1 


m ' 


2 = 1 


First Term. Note that we have a product of independent subexponential random variables, and 
so if we condition on the bounds 


1 


m 


'^{afxf < 3 


2=1 


max (af xY < 91offm 

both of which happen with probability at least 1 — 1/m^, we find via Bernstein that 

^ ^(afx)2((afz^)2-||zu||i)| > 6/Q] < 2exp ( - cmin{^ ^ I) 


' m 


2=1 
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which we can make smaller than e so long as m > C5 ^anr log (n) and we conclude via an 
e-net argument that 

- m 

P[||—(8) (aiafPj_) — Pj_||| > <5/3] < -t-2/m^ (34) 

^ IxigiR" 

where Pj_ is the orthogonal projection onto the complement of X in M". 


Third Term. We recognize {ajz±) as a sub-gaussian random variable, and thus if we condi¬ 
tion on 

— < 16 (35) 

i=\ 


we find via Hoeffding that 


9 _ , 

Tfl . 

2 = 1 


cd^m^ 
16 ^ 


We can make this bound smaller than e so long as m > C6 “^arn. As (35) happens with 
probability greater than 1 — 1/m? we find 

P[||—0 (PaiafPi)||| > (5/3] < 6“^™-I-1/m^. (36) 


Second Term. For this term we further decompose w into its x-component and its x_L-component. 
We can apply the same analysis for the first and third terms to the x_\_ terms, and have only to 
deal with 

p | (x W _ 3(a.Ty^)2| > 

I ^ I 

2=1 

If we condition on 

max (ajx)^ < 81(logm)^ 
l<2<m 

and note that 

|3 — E[(afx)^|(afx)'^ < 81(logm)^] | < 1/m^ 

we find via Hoeffding that 



2 = 1 


E[...]| > (5/3] < exp ( 


—5‘^m? 


Cm + 81(logm)2(5m 


) 


(37) 


which we can make smaller than e so long as m > Ca5 ^r(logr)^. Moreover, note that we can 
also make (37) smaller than 1/m? for m > C5~^. We conclude that 

^ m 

E[||— QiaJ ® Qiaf — K[aiaJ ® aiof]|| | > <5/3] < 3min{e“^'’, 1/m^} -|- 3/m‘^. (38) 

,_i lx®x 


Combining (34), (38), and (36) yields the stated result. 


□ 
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Corollary 5.2. Suppose we collect m > C6 ^/3nrlog(n)^ samples of the form yi := afXX'^Oi, 
where 6 and j3 are given constants and r = rank(X); then we have that with probability greater than 
1 _ 2e-/5™ - 6/m2 

- y ViaiaJ - \\X\\yd - 2X*X*"' 

i=l 

Proof. Note that yi = and so by Theorem 5.1 we find that with probability greater 

than 1 — — Q/m? 


<5\\xy 


m 


yyjxkfajaj - \\xk\\lld - 2xkx'l 


i=l 


< ^\\Xk\\l 


op 


and so 


- y ymaf - \\X\\lld - 2XX^ 


2=1 


- ||~ “ \\xk\\ldd - 2xkxl\ 


op 


' m 

k=l 2=1 


I op 


<6\\X\\l. 


□ 


Corollary 5.3. Suppose we collect m > C6 ^/3nrlog(n)^ samples of the form yi := af XX'^at, 
where 6 and j3 are given constants and r = rank(X); then we have that with probability greater than 
1 _ - 6/m2 

\yfix)-Eyf{x)]y< 26 \\x\y 

Proof. Note that 

m 

V^f{X) = y [X^aiafX] ® a^af 
2 = 1 

and so we can use Theorem 5.1 to write 


X^ 0 Id 



m 

2=1 


' a? a 


2^2 



X0ld 




op 


(39) 

□ 


5.1.3 Proof of the Convexity Theorem 2.1 

We will rely on the following Lemma from [Ben03], as stated in [CLS14]: 

Lemma 5.4. Suppose Li, L 2 ) Xm are i.i.d. real-valued random variables obeying Yi < b for some 
nonrandom b > 0, E[yj] = 0, and IE[y] = v'^. Setting a'^ = m ■ max(6^,r;^), 

P[li +Y 2 + ... + Ym>y]< minjexp >co(l - 4'(y/cr))| 

where one can take cq = 25 and <h(-) is the CDF for the standard normal. 
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Proof of Theorem 2.1. Let W := VL/HPLUi? be the normalized direction from U to XO* and let 
t > 0 be a positive scalar. Moreover, WLOG we will be assuming that ||^||f = 1- 

Finally, because f{U) is invariant under the action of 0{r), it suffices to consider the case where 
O* = Id. 

Consider the single-variable function f(X + tW) which can be written 

/(t) = — ^ (2tafXW^ai + t^aJWW^a^ . 

2=1 

It is straightforward to verify that 

f"{t) =—^2,{aJWW'^a^^ + Q [ajWW^{ajXW^t + 2 [ajXW^ (40) 

2=1 

which is a convex polynomial in t; observe that f"{D) > 0. If the linear term is positive, then clearly 
(40) is positive for all t and we have nothing to show (the smallest eigenvalue is bounded below by 
f"{D) in the direction W). Define the following quantities: 

Ai := = ||ID^ai||i 

Bi := (afXW^a^ = {X^a^^W^af) 

and note that (40) can be written as 

{A,t + Bif - - 
m 



f"(t) = -Yl 

m 

i=l 


Consider the random variable 

Zi{t) := {Ait Bi)^ > 0 

Observe that Ai is a chi-squared random variable with 1 degree of freedom by the normalization 
||IF||i7 = 1. Thus we have 


E[A] = 1 

ml] = 3 

E[Af] = 105 
E[Af] = 10395 
E[2lf] = 2027025 
E[2iP] = 18602008425. 

By the definition of f"{t), we have 

/"(O) = vec(lT)^E [V'^fiX)] vec(lT) 

and so 

nsf] = ^vec(lT)^E [X^fiX]] vec(lT). (43) 
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Moreover, 


E[AiBi] = E 


= E 


'^{ajwkf '^{aj Xk){aJ Wk) 

L\fc=i / \fc=i / 


k,q=l 


^ 2{wlxq){wlwq) + \\Wk\\lxlwq 

k,q=l 


2 ^ X^Wkwlwq 


+ y~] x; w, 


T 

q '^q 


9=1 


k,q=l 
r 

= 2 ^ x^WW^Wq + tr(X^W) 

9=1 

= 2tr(X^VFVK'^VF) + tr(X^W) 

We then have that the mean ;u(t) := E[Zj(t)] is given by 

St^ + (^4tr(X'^WW^W) + 2tr(X'^W)) t + ^vec(W)^E [V^f{X)] vec(W). 
Next we consider the variance of Zi{t): 

E[iZ,it)-fkit)f]<E[Zfit)] 

= E [Af] + 4E [A^Bi] + 6E [A‘fBf] + 4E [AiBf] t + E [Bf] 


= 105r + 4E 


A|{X^a^,W^ai) 


+ 6E 


Af{X^ai,W^aif 


r + 4E 


Ai{X‘^ai,W^ai)^ 


t+ E 


(44) 


{X’^auW^ai)^ 


< 105t^ + 4WE[^6]E {XTai,W^ai)^ 


+ 6E[^3||^r^.||2j^2 ^ 4JE[Aj]E[{X^ai, W^ai)6]t + E 


iX'^a. 


'2 II 2 


IIW'^ 


where we used either Cauchy-Schwarz or Holder’s inequality on each term. Proceeding with ap¬ 
plications of Holder and Cauchy-Schwarz, recognizing that is also Chi-squared with one 

degree of freedom under the assumption that ||^||f = Ij and noting from (43) that E[i3?] < 3 we 
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have 


+ AyJ10395 ■E[Bf]t^ 


+ 6E[Af\\X^ai\\l]t‘^ + 4 J3 • E[{X'^ai, W'^ai)^]t + E \\X^' 


< 105t^ + 707^3 + oJE[A%XTai\\l]t^ + 4^3 • E[\\XTai\\l\\WTai\\l]t + Je \\X^ai\\l]E[\\WTai\\l 


< 105 t^ + 707^3 + 6 {E[A}^]E[\\X^ai\\l]Y^^ 

< 105t^ + 707^3 + 709if + 707t + 105 
= C{tf. 

Observe that the mean can be bounded as 

^{t) < 3t^ + 6t + 3Ai 

Now define 


^ai\\f]E[\\W^aiWf]] ' t + 105 


1/4 


y^{t) 

= fi{t) - Zi{t) 

b{t) 

= 32^ -|- 62 + 3A 

v\t) 

= C{tf 

a\t) 

= mC{t)‘^ 

y 

= mXr/12. 


Applying Lemma 5.4 above yields 

^ m 

f^it) - -Y^Z.it) > Xr/12 


2 = 1 


. . . , \lm 

< mm < exp I —: 


144 C ( t )2 


,25 l-cD( 


Xr\/rn 


12C{t) 


Using the well-known bound 


1 12(7(2) 

1 - 4>( ^ ^—^=77= exp 


Xl 


m 


12(7(2) XrV2rmT V 288(7(2)^^ 
we find that if m > 288Q;A7^(7(2)^nr then with probability at least 1 — we have 


ty m ^ m 

nt) = -^{Aa + B,f--^B, 

m m 


2=1 


2=1 


>3/6(2) Ar/4 y~l B. 


2=1 


> 22^ - 62 - Ar/4 + ^vec(lU)^E [V^f{X)] vec(lU) - ^vec{WfV^f{X)Yec{W) 


( 46 ) 
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where we used (44) to lower bound by 

t^-6t+ ^Yec{WfE [V^f{X)] vec(H") 

and the fact that 

1 ™ 1 

;jvec(VF)^VV(X)vec(iy). 

i=\ 

Moreover, an e-net argument over all directions W shows that (46) holds for an arbitrary W with 
probability at least 1 — . 

Further observe that our condition on m guarantees 

with probability at least 1 — — 6/m? by Corollary 5.3. This implies that 


-vec(lT)^E [V^f{X)] vec(lT) - -vec(lT)^V2/(X)vec(lT) > — 

2 2 8 

so that 

f{t) > 2t^ -6t + ^Xr 
8 

with probability at least 1 — — 6/m^ for any direction W. Thus by a tangent line bound we 

find that the smallest positive root of f”{t) is bounded below by 


15 3 

At* A^. 

- 48 10 


and we note that 


,15 


15 


/"(at A.) = 2(y.) 


M8 


48' 


X2 

“ 18' 


2a2 + ^(A.-A.) 

o 


Thus f”{t) > for all t £ [0, ^Ar] which is the advertised lower bound. 
For the upper bound, observe that by Cauchy-Schwarz 


m 


^(afWW^ 


Oi) [afXW^ai] < 6. 


i=l 


. J'laTWWTa, 


^ m 

, -ylaJXW^i 


and thus we find an upper bound for (40) is given by 

f"{t) < 3a^t^ + 6abt -|- 26^ 

where 


a = 


b = 


^ m 


- m 

-j:(ar.W^, 




(48) 
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Now, LOi := (^afWW^ai^ is a chi-squared random variable with one degree of freedom; conse¬ 
quently we find 

1 m 

P[- y wf > < r^^_cnr < ^-Cnr 

2 = 1 

as long as m > Cnr. Consequently an e-net argument shows us that for any W G ||hh||F = 1 

we have 

1 2 

- y 

2=1 

with probability greater than 1 — 

Returning to (48) we hnd then that 

< 3{at + b)'^ — 

< 3 {CnrXr + b)^ 


for all t G [ 0 , ^A^]. 

To finish the proof, observe that we can write 


1 


b= ^Vvec(lT)'^V 2 /(^)vec(lT) 

v2 


and by Corollary 5.3 (where, given the number of measurements m, we may take <5 > CXr/Xi) we 
have 


1 


b < ^\/vec(lT)^E [V 2 /(X)] vec(lT) + 2 Xr 

y/2 ^ 

- 


yielding 


f"{t) < C{n^r‘^Xl + Xr). 


From everything above, we conclude that for ^ 1^2 ^; 


1a2 ^ 

18 TllVlllJ 


U-X 


IXI 


< vec 


U-X 

IIXIIf 


vV 


< C ( n\‘^Xl ( 


Since 


X^f{U) = X^f{U-X) = ||X||^V 2 / 


( ^ 


VII^IIf 

’ ll^ll 

. /xx'r 

+ Ar ^ 

l^lll 

f u . 

X 


IXI 


IXI 


vec 


U-X 


U-X 


^IIf 

2 




(51) 
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we conclude that for general X, it holds for \\U\\f < io||x||j, 

vec {U - Xf VV {U; X) vec {U - X) > \\U - X\\l 

Yec{U - Xfv^f {U;X)Yec{U - X) < C \\U - Xfp (52) 

□ 


5.2 Proofs for 3.3, Initialization and Convergence 


Proof of Lemma 3.11. It suffices to prove the case = 1- By Corollary 5.2 we have that with 

probability greater than 1 — 2e“^'’"' — Q/m? 


m 


^ ymaj -Id- 2XX^ 


2=1 


< <5 


Op 


so long as m > C6 ^/3nr log(n)^. This implies 


Let 


|M - {l/2)Id - XX^W^^ < 6/2. 


A:= M - {l/2)Id 


and collect the unit normalized eigenvectors corresponding to the dominant r-dimensional subspace 
of A in a matrix U G Let ui > <72... > <7^+1 > 0 denote the eigenvalues of the observed 

matrix M and define 

(7l 0 ... 0 

Al 0 ^2 •** b CTy.-^^I dj. y. . 

. 0 ar 

L ' J j>xr 

Let Uq := and Q := OiOj where {X"^X)~^/‘^X"^U = OiZAOj is the singular value decom¬ 

position and observe 


[/SV2 _xQ = - U{X^Xfl‘^ + \J{X^Xfl‘^ - XQ 


< 


< 


< 


U - X{X^X)-^/‘^Q Q^{x'^xfl‘^ 


+ 


op 


op 


sV2 _ 


+ || e >/2 _ 


(53a) 

F 

(53b) 

(53c) 

(53d) 
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where (53c) follows from Theorem 2 in [YWS15] and (53d) follows from 


sV2 _ 


A ^ ^ I CTr+l \/~^i 

\ i=l 




^r+1 


\/o^ CTr+l T ''/^i 


< 


1 


< 


y/Xr 

1 




^(||^-XY^||^ + |a.+i-l/2|) 


1 




2A^ 

The first equality holds because X has orthogonal columns and thus X'^X is a diagonal matrix. 
Now, we have 

T{yFi+i) < Ia, 


20 


so long as 


5 < C{^/r) ^ min(A^/-v/XL, A^) = Cr ^/^A^ 
which gives the stated claim. 


□ 


5.3 Proofs for Rank-one Matrix Recovery, §3.2 
5.3.1 Proof of Lemma 3.5 

Proof. Note that by (19) we only need to compute E[(a'^ri)^aa'^] for an arbitrary u G M”. We will 
consider the slightly more general expectation K[{a^u){a^w)aa^] for arbitrary u,w ^ M”. Begin 
by assuming 'E = Id where Id is the n x n identity matrix. Let i,j G [n] be arbitrary coordinates. 
We have 

n 

¥,[{a^u){a^w)aj\ = E[a? a\ukWk + akajU^Wj] 

k=l k^j 

= lliUiWi + ^ UkWk 
k^i 

= (/X4 — l)UiWi + U^W 

and for i ^ j, 

n 

K[{a^u){a^w)aiaj] = E[ojaj a\ukWk + a-iOj a^ainkWi] 

k=l k^l 

= UiWj + WjUi 
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so that 

n 

K[{aFu){aFw)aaF] = {vFw)Id + uuF + wvF + — 3) UkWkekel. (56) 

k=l 

If S 7 ^ Id, then observe that if we define b := 

{JufaJ = 

then the inner term satisfies the assumptions needs for (56), and so we find 


W.[{JufaJ] = ( WlF/^^uWlld + + {^4 - 3) ^(S^/2n)|efce^ ) 

V k=l 

n 

= + 2Snn^S + (/X 4 — 3) ^^{v'^F'^Vkv'^ 


k=l 


where = [ui V 2 ... Vn]^ 


5.3.2 Proof of Lemma 3.9 

We begin with an eigenvalue bound. 

Lemma 5.5. Let u G M” he a unit vector and consider the traceless matrix 


rz T ST^ 2 T 

Z:=uu -^^Ui^ekCk- 


Suppose that 
Then we have 


k=l 


0<ul< ul_^ < ... < ul 


>^min{Z) G [-minjui, 1/2},-U 2 ] 
^max{Z) G [0, 1 — U^). 


Proof. Suppose first that > 0. Then we can use the determinant formula 

det (Z - Xld) = det ( '^{-ul - \)ekel | • (l-u^('^{ul + X)~^ekel 


u 


\k=l 


\k=l 


Yii-ul - A) • ( 1 - X] 


k=l 




which shows us the following: 

1. If any uj = Uj_f_i then A = —Uj is an eigenvalue. 


□ 


(57) 
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2. If all of the squared coordinates are distinct then each eigenvalue A satishes 


n 


E 



= 1 


and because there will be a vertical asymptote at each —we see the eigenvalues of Z 
interlace the squared coordinates, the smallest occurring somewhere between (—, —u\) and 
the largest somewhere after — 

In general, if some of the coordinates are 0, we see that with the assumed ordering Z will be a 
block matrix and we can apply (57) to the reduced space where Z acts nontrivially. 

The bound Xmin > —1/2 follows from 


min 

Il3/||2 = l = |b||2 



k=l 


Note that by the Gershgorin Circle Theorem, the largest eigenvalue Xmax is no larger than 1—u^. □ 

Proof of Lemma 3.9. Suppose that ||x ||2 = 1 and let 


5(/r) := Xmin Id + 2xx^ + (/r - 3) ^ xlete^ 

V k=l / 


Lemma 5.5 above shows g{l) > 1 — min{2r(x),l} and it is clear that g{3) = 1. By concavity of 
Xmin{-) we then hnd 


g{g,) > min{r(x), 1/2};U + 1 — 3min{r(x), 1/2} 
= 1 + min{/r — 3, 0} min{r(x), 1/2} 


for all fj. G [1, 3]. 

If /U 4 > 3, then 2xx^ + (/44 —3) Ylk=i ^ positive semi-definite matrix and thus g{fL) > 1. 

Observing that 

2g{g)\\xf2 = Xmin (E [X7^f{x)]) 

produces the bound (27). □ 


5.3.3 Proof of Lemma 3.6 


Proof. Begin by assuming T, = Id and reparametrize an arbitrary u G as u = x — tw for 
||t (;||2 = 1. Note that 


lit 

V^/(x — tw) = — (2(afx)^ — 6 (af x)(af w)t + 3(ajw)‘^t‘^) aiaj 

m 

2 = 1 


(59) 
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so using (56) we find 


E[V^f{x - tw)] = 3 


-6 


+ 2 


Id + 2ww'^ + (^4 — 3) ^ wlekel 

k=l 

{x^'w)Id + xw'^ + wx"^ + (//4 — 3) XkWke^e^ 


k=l 


\x\\li 


2 T 
Xk^kek 


(60) 


(61) 


\ld + 2xx^ + (^4 — 3) ^: 

Observe that 

n 

||x|||/d + 2xx'^ + (/i 4 — 3) x^CkG^ ^ (1 + niin{r(x), l/2}[/i4 — 3]_) ||x|||/d 

k=l 

and that 

n 

{x'^w)Id + xuF + wx^ + {fii — 3) XkWkBkel ^ (3 + t(x )[^4 — 3]+) ||x|| 2 /d. 

k=l 

For (60), we used Lemma 3.9. Lastly, note that 

n 

Id + 2wvF + — 3) ^ wlcke^ ^ 0. 

k=l 

Consequently we can define the polynomials 

Qy,w{t) := y'^E[V‘^f{x - tw)]y 

and by convexity we can bound the smallest positive root by the intercept of the tangent line; the 
bounds (60) and (61) thus yield the stated conclusion for S = Id. 

For general covariance matrices, note that we have just shown that 

whenever and are close enough, which implies 

E[V^f{u)] h 0. 

□ 


5.3.4 Proof of Lemma 3.8 

Proof. Begin by assuming E = Id and ||x ||2 = 1. Note that because of the sub-gaussian assumption 
we have that for m> C 

P [(afx)^ > clogm] < exp (1 — clogm) 

^ -4 

< m 

P [llailli ^ cnlogm] < 2exp (—clogm) 

^ -4 

< m 
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where the constants depend on the sub-gaussian norm of Oj. Consequently 


max {al xY\\ai \\2 > c^n(logm)^ 

l<i<m 


< 2m{m Y < 2m 


.-3 


and if we define the truncated random variables a, := aiX(a^a;)2||ai|||<c2n(iogm)2 the analogous 
truncated matrix 


M = — '^{afxYdiaf, 
m 

2=1 


Bernstein’s inequality (Theorem 4.1 in [Trol2]) tells us that 


M - E[M] 


> h 


op 


< nexp 


-5^m72 


cr^ + mn(logm)25/3 


where is given by 


; = 


E®[(‘ 


- (E [( 


^2 ^2^2 


2=1 


= m 


E[C 


a X) \\a\\r,aa 


-(E[( 


a X) aa 




'])■ 


op 


op 


(62) 


< Cmn 


where (7 is a constant which depends on the moments of a*. 

Lastly observe that if di := aiX(a^x)2||ai||^>c2n(iogm)2 then we can write 

E[M] - E[M]|[^ = ||E [{d^xYdd^] 

< E [(a^a;)^||a||2] 

pc^n{logm)^ poo 

= / P [(a^x)^||a ||2 > c^n(logm)^] dt+ 

Jo Jc- 


P [(a'^x)^||a||2 > t] dt 


c^n (log m)^ 

= c^n(logm)^ ^P [(a'^ 2 ;)^||a||| > c^n(logm)^] + J P [(a^x)^||a ||2 > ac^n(logm)^] da 
< 3/m^ 

(63) 

where we used Jensen’s inequality for the first line and have assumed m > Cn. Consequently we 
find that for m > Cn 


||M-E[M]7p>e <P M/M 


+ ■ 


M - E[M] 


> e 


< m{2m ) + nexp 


op 

—J^m^/2 


cr^ + mn(logm)^(5/3 


where 6 := e — ‘ijmJ from (63). Now, all we have left is to show that the exponential can be made 
less than a power of m. Using (62) we find that we need m to satisfy 

m > n (log n + 2 log m) ■ (C + (log m Yd) JY 
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for which it suffices to require m > Ce“^n(logn)^ for some constant C which only depends on the 
moments of a*. 

For the more general statement note that our previous work shows 


1 

m 


i=l 


E 






whenever m>Ce ^||S||Qpn(logn)^. Consequently, 


(64) 


— {afx)'^aiaf — E r(a'^x)^aa'^l 


2 = 1 


op 




2=1 


< IISI 


op 


m 




2=1 


op 


< e\\T}l'^x\ 


which is the desired claim. 


□ 


5.3.5 Proof of Theorem 2.3 

Proof of Theorem 2.3. Assume without loss that ||x ||2 = 1 and that w is the normalized direction 
from u to X. Moreover begin by assuming S = Id. Closely following the proof of Theorem 2.1 we 
first note that 

f”{t) = — + 6{a[ w)^ (of x)t + 2 {afx'uFaif‘ 


2 = 1 


and define 

where 


■— + Bi) > 0 


Ai := (^afww'^ai) 


Bi := iyafxvirai) . 

Note that by the computations done in Lemma 3.5 we have 
bL{f) ■.= E[Zi(t)] 


— (3 + (/i4 — 


w 


+ 2 ^3x'^rD + (/i4 — 3) x^w^ t + -r()^E[V^/(x)]u) 


< (3 + [fj,4 — 3]-|_) + (6 + 2[/i4 — 3]-|-) t + 

and that the variance 


3 + 2[/i4 — 3]-|_ 


where C{t) depends only on the subgaussian norm of the a*. 
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Now, for a given e G (0,1) define 


Y,{t) := fi{t) - Z,{t) 

b{t) := (3 + [^4 — 3]+) + (6 + 2[fi4 — 3]+) t + --^ 

v\t) := C{tf 
:= mC{t)‘^ 

y := mAmm(IE[V^/(a;)])/12 = mA/12. 


Applying Lemma 5.4 above yields 

^ m 

p ^it) —> A/12 
m 

i=i 

Using the well-known bound 

1 - 4 >( 


< min < exp I — 


X^m 


m , 12 C'(t) 

< ^— 7 == exp 


144C(f)2 J ’ 

A^m 


25 1 - 4>( 


m 


l2C{t) 


\2C{t) X\/2m7T 


288C(ty 


we find that if m > 288aA ^C(t)^n then with probability at least 1 — e “"we have 

^ m ^ m 

m m 

i=l 

^ m 

>3M(t)-A/4--^i?2 

m i ^ 


2=1 


2=1 


W 


> (9 -|- 3 [/i 4 — 3]_)t^ -|- ( 6 [/i 4 — 3]_ — 3)t -|- ib'^E [V^/(x)] w + [^^/(^)] ^ ~ 2 

> (9 + 3[//4 — 3]_)t^ -|- ( 6 [/i 4 — 3]_ — 3)t -|- w'^E [V^/(x)] w — X/A — e/2 

> (9 -h 3[//4 - 3]_)t2 (6[/i4 - 3]_ - 3)t -h A/2 

(67) 

where we used the concentration guaranteed by Lemma 3.8 above with e < A /8 and the fact that 


^f2Bf = lw^Vyix)w. 

m 2 


2=1 


Consequently, using a tangent line bound for the smallest positive root we find that for all 


° ^ ^ 6 - 12[^4 - 3]_ 

we have 

fit) > Xfl2. 

Moreover, an e-net argument over all directions w shows that (67) holds for an arbitrary w with 
probability at least 1 — e“^". 

For general covariance matrices, apply the previous argument to and with mea¬ 
surements bi = as usual. 

□ 


'vV(a:)u’ 
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